By: Victor Cornejo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import sqlalchemy
from pathlib import Path
plt.style.use('fivethirtyeight') # Use plt.style.available to see more styles
sns.set()
sns.set_context("talk")
np.set_printoptions(threshold=5) # avoid printing out big matrices
%matplotlib inline
In this project I took a look at how states voted in presidential elections between 1972 and 2016. My ultimate goal was to show how 2D PCA scatterplots could allow us to identify clusters in a high dimensional dataset. For this example, that meant finding groups of states that voted similarly by plotting their 1st and 2nd principal components.
For this, I explored a dataset on U.S. Presidential Elections by State since 1789, as taken from Wikipedia.
df = pd.read_csv("data/presidential_elections.csv")
df.head(5)
| State | 1789 | 1792 | 1796 | 1800 † | Unnamed: 5 | 1804 | 1808 | 1812 | 1816 | ... | 1992 | 1996 | 2000 ‡ | Unnamed: 60 | 2004 | 2008 | 2012 | 2016 ‡ | 2020 | State.1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | R | R | R | NaN | R | R | R | R | R | Alabama |
| 1 | Alaska | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | R | R | R | NaN | R | R | R | R | R | Alaska |
| 2 | Arizona | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | R | D | R | NaN | R | R | R | R | D | Arizona |
| 3 | Arkansas | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | D | D | R | NaN | R | R | R | R | R | Arkansas |
| 4 | California | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | D | D | D | NaN | D | D | D | D | D | California |
5 rows × 67 columns
The data in the table was pretty messy (missing records, inconsistent field naming, etc.), so I created a clean version.
Below I created a clean table, which contained exactly 51 rows (corresponding to the 50 states plus Washington DC) and 13 columns (one for each of the election years from 1972 to 2020). The index of this dataframe was the state name.
df_clean = (
df.iloc[:, -15:]
.drop(['Unnamed: 60'], axis = 1)
.rename(columns = {"2000 ‡": "2000", "2016 ‡": "2016", "State.1": "State"})
.drop([51])
.set_index("State")
)
df_clean.head()
| 1972 | 1976 | 1980 | 1984 | 1988 | 1992 | 1996 | 2000 | 2004 | 2008 | 2012 | 2016 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | |||||||||||||
| Alabama | R | D | R | R | R | R | R | R | R | R | R | R | R |
| Alaska | R | R | R | R | R | R | R | R | R | R | R | R | R |
| Arizona | R | R | R | R | R | R | D | R | R | R | R | R | D |
| Arkansas | R | D | R | R | R | D | D | R | R | R | R | R | R |
| California | R | R | R | R | R | D | D | D | D | D | D | D | D |
I could see that each row represented the outcome in every of every presidential election from 1972 to 2020. Displaying the corresponding winners (Republican or Democrat).
To perform PCA, I needed to convert the data into numerical values. So I created a df_numerical dataframe that replaced all of the "D" characters in df_clean with the number 0, and all of the "R" characters with the number 1.
df_numerical = df_clean.replace("D", 0).replace("R", 1)
df_numerical.head()
| 1972 | 1976 | 1980 | 1984 | 1988 | 1992 | 1996 | 2000 | 2004 | 2008 | 2012 | 2016 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | |||||||||||||
| Alabama | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Alaska | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Arizona | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
| Arkansas | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| California | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Then, I standardized the data: Centered the data so that each column's mean was 0, and scaled the data so that each column's variance was 1. I then stored my results in df_standardized.
df_mean = np.mean(df_numerical.reset_index(drop=True), axis = 0)
df_std = np.std(df_numerical.reset_index(drop=True), axis = 0)
display("mean", df_mean.head(), "std", df_std.head())
'mean'
1972 0.960784 1976 0.529412 1980 0.862745 1984 0.960784 1988 0.784314 dtype: float64
'std'
1972 0.194108 1976 0.499134 1980 0.344116 1984 0.194108 1988 0.411298 dtype: float64
df_standardized = (df_numerical.reset_index(drop=True) - df_mean) / df_std
df_standardized.head()
| 1972 | 1976 | 1980 | 1984 | 1988 | 1992 | 1996 | 2000 | 2004 | 2008 | 2012 | 2016 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.202031 | -1.060660 | 0.398862 | 0.202031 | 0.524404 | 1.354006 | 1.297771 | 0.836660 | 0.803219 | 1.148121 | 1.060660 | 0.836660 | 1.019804 |
| 1 | 0.202031 | 0.942809 | 0.398862 | 0.202031 | 0.524404 | 1.354006 | 1.297771 | 0.836660 | 0.803219 | 1.148121 | 1.060660 | 0.836660 | 1.019804 |
| 2 | 0.202031 | 0.942809 | 0.398862 | 0.202031 | 0.524404 | 1.354006 | -0.770552 | 0.836660 | 0.803219 | 1.148121 | 1.060660 | 0.836660 | -0.980581 |
| 3 | 0.202031 | -1.060660 | 0.398862 | 0.202031 | 0.524404 | -0.738549 | -0.770552 | 0.836660 | 0.803219 | 1.148121 | 1.060660 | 0.836660 | 1.019804 |
| 4 | 0.202031 | 0.942809 | 0.398862 | 0.202031 | 0.524404 | -0.738549 | -0.770552 | -1.195229 | -1.244990 | -0.870988 | -0.942809 | -1.195229 | -0.980581 |
I now had my data in a nice and tidy centered and scaled format. Phew! I was now ready to do PCA.
In the following part, I computed the SVD of df_standardized:
I stored the $U$, $\Sigma$, and $V^T$ matrices in u, s, and vt respectively. This was one line of simple code, using the np.linalg.svd function with the full_matrices argument set to False.
u, s, vt = np.linalg.svd(df_standardized, full_matrices=False)
u, s, vt
(array([[-0.16256283, 0.07765483, 0.00122229, ..., -0.0134597 ,
0.08820748, -0.08217091],
[-0.16706061, -0.0162679 , -0.12628238, ..., -0.04426333,
-0.01355807, -0.07427217],
[-0.09532553, -0.04755254, -0.03905914, ..., 0.45011822,
0.13814776, 0.20684122],
...,
[-0.03836168, 0.28942847, 0.23831932, ..., -0.08179345,
-0.12580385, 0.02809131],
[ 0.13046354, 0.02585922, 0.16202611, ..., 0.42873974,
-0.07025696, -0.0487704 ],
[-0.16706061, -0.0162679 , -0.12628238, ..., -0.04426333,
-0.01355807, -0.07427217]]),
array([18.07691752, 10.25583887, 7.64572873, ..., 2.80257708,
2.26892781, 1.70403666]),
array([[-0.136002 , -0.04058259, -0.15155429, ..., -0.35596474,
-0.32690835, -0.33987784],
[-0.3492229 , -0.48079419, -0.4675855 , ..., 0.18624495,
0.15138213, 0.15659938],
[ 0.41610347, -0.48658899, -0.02929042, ..., 0.03473676,
0.22737918, 0.113887 ],
...,
[ 0.04435877, -0.04309004, 0.18776949, ..., -0.04299605,
0.49573543, -0.77479348],
[ 0.01144521, -0.11524943, 0.04285712, ..., -0.05855334,
-0.11179211, -0.16276473],
[ 0.01003386, 0.00671822, 0.03591355, ..., 0.80093148,
-0.07451528, -0.13458479]]))
Using my results from the previous part, I created a new first_2_pcs dataframe (documentation) that contained exactly the first two columns of the principal components matrix. The first column was labeled pc1 and the second column was labeled pc2. I stored my results in first_2_pcs, and made sure to set the index to be the default numerical range index.
first_2_pcs = df_standardized @ vt[0:2, :].T
first_2_pcs.rename(columns = {0: "pc1", 1: "pc2"}, inplace = True)
first_2_pcs.head()
| pc1 | pc2 | |
|---|---|---|
| 0 | -2.938635 | 0.796415 |
| 1 | -3.019941 | -0.166841 |
| 2 | -1.723192 | -0.487691 |
| 3 | -1.698397 | 0.807836 |
| 4 | 2.376794 | -1.807335 |
Below I plotted the 1st and 2nd principal components of the 50 states + Washington DC.
# just run this cell
#plt.figure(figsize=(6, 3))
sns.scatterplot(data = first_2_pcs, x = "pc1", y = "pc2");
Unfortunately, I had two problems:
I started by addressing problem 1. I created a new dataframe first_2_pcs_jittered with a small amount of random noise added to each principal component, and then created a scatterplot. To reduce overplotting, I jitter-ed the first two principal components:
np.random.normal (documentation) with mean 0 and standard deviation less than 1.# first, jitter the data
first_2_pcs_jittered = first_2_pcs + np.random.normal(0, 0.1, first_2_pcs.shape)
# then, I created a scatter plot
#plt.figure(figsize=(6, 3))
sns.scatterplot(data = first_2_pcs_jittered, x = "pc1", y = "pc2");
To address Problem 2, I turned to Plotly. Below I created a scatter plot of the jittered.
import plotly.express as px
# get the state names from the standardized dataframe's index
first_2_pcs_jittered['state'] = df_standardized.index
# show state names on hover
fig = px.scatter(first_2_pcs_jittered, x="pc1", y="pc2",
hover_data={"state": True});
fig.show();
first_2_pcs_jittered['state'] = df_numerical.index
# show state names on hover
fig = px.scatter(first_2_pcs_jittered, x="pc1", y="pc2",
hover_data={"state": True});
fig.show();
df_numerical.reset_index().T[8]
State D.C. 1972 0 1976 0 1980 0 1984 0 1988 0 1992 0 1996 0 2000 0 2004 0 2008 0 2012 0 2016 0 2020 0 Name: 8, dtype: object
</br> Analyzing the plot from above I could see that:
1.- An example of high cluster states was [North Dakota, Wyoming, Oklahoma and Nebraska] closely clustering with [Alaska, South Dakota, Idaho and Kansas]. This doesn't surprise me as it was expected for similarly leaning states (in this case conservative) to cluster together. </br> 2.- Something interesting I observed was also that Conservative states had been less suceptible to vote for a Democrat, whereas some Democrat states changed their mind around the end of the 80's from Republican to Democrat! What happened there? Reaganism?
I could also look at the contributions of each year's elections results on the values for our principal components.
Below, I defined the plot_pc function that plotted and labeled the rows of $V^T$. I then called this function to plot the 1st row of $V^T$, i.e., the row of $V^T$ that corresponds to pc1.
def plot_pc(col_names, row_mat_vt, k):
plt.bar(col_names, row_mat_vt[k, :], alpha=0.7)
plt.xticks(col_names, rotation=90);
plt.figure(figsize=(12, 4)) # adjusts size of plot
plot_pc(list(df_standardized.columns), vt, 0);
Then I plotted the the 2nd row of $V^T$, i.e., the row of $V^T$ that correpsonds to pc2.
plt.figure(figsize=(12, 4))
plot_pc(list(df_standardized.columns), vt, 1);
Then, using the two prior plots of the rows of $V^T$ as well as the original table, I gave a description of what it meant for a point to be in the top-right quadrant of the 2-D scatter plot.
In other words, what was generally true about a state with relatively large positive value for pc1 (right side of 2-D scatter plot)? For a large positive value for pc2 (top side of 2-D scatter plot)? -- It meant that it had very low variance across the elections and that it was leaning Democrat. e.g. D.C. The PC2 then, was the orthogonal set of values that retained the most information while maximizing the variance.
To get a better sense of whether my 2D scatterplot captured the whole story, I created a scree plot for this data. In other words, I plotted the fraction of the total variance (y-axis) captured by the ith principal component (x-axis).
df_standardized.shape[1]
13
np.array(range(len(s)))
array([ 0, 1, 2, ..., 10, 11, 12])
plt.figure(figsize=(6, 3))
plt.xticks(np.array(range(len(s))))
plt.title("Scree Plot")
plt.xlabel("Principal Component")
plt.ylabel("Singular Value")
plt.plot(range(len(s)), np.square(s) / df_standardized.shape[0]);
From the scree plot above, I could see that the first two principal components captured a large portion of the variance in the cleaned data. It was partially for this reason that the 2D scatter plot of principal components was easy to interpret.